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Abstract 

Microprocessors are designed to provide good aver- 
age performance over a variety of workloads. This can 
lead to inefficiencies both in power and performance for 
individual programs and during individual phases within 
the same program* Microarchitectures with multi- 
configuration units (e.g caches, predictors, instruction 
windows) are able to adapt dynamically to program be* 
havior and enable/disable resources as needed. A key 
element of existing configuration algorithms is adjusting 
to program phase changes. This is typically done by "tun- 
ing" when a phase change is detected - Ue. sequencing 
through a series of trial configurations and selecting the 
best. 

Algorithms that dynamically collect and analyze pro- 
gram working set information are studied. To make this 
practical, we propose working set signatures - highly 
compressed working set representations (e.g. 32*128 
bytes total). Algorithms use working set signatures to 1) 
detect working set changes and trigger re-tuning; 2) iden- 
tify recurring working sets and re-install saved optima! 
reconfigurations, thus avoiding the time-consuming tun- 
ing process; 3) estimate working set sizes to configure 
caches directly to the proper size, also avoiding the tun- 
ing process. Multi-configuration instruction caches are 
used to demonstrate the performance of the proposed 
algorithms. When applied to reconftgurable Instruction 
caches, an algorithm that identifies recurring phases 
achieves power savings and performance similar to the 
best algorithm reported to date, but with orders-of- 
magnitude savings in the number of re-tunings. 



1. Introduction 

As microarchitecture and chip technology evolve, 
tradeoffs involving performance, power, and complexity 
become increasingly difficult, and optimization methods 
become increasingly sophisticated. One promising opti- 
mization method is to configure microarchitecture fea- 



tures dynamically to adapt to changing program 
characteristics [1-13]. As a program runs, it passes 
through phases of execution where its performance 
characteristics and, consequently, its hardware resource 
requirements may vary [14, 15]. Performance and/or 
power consumption can be optimized on-the-fiy if signifi- 
cant phase changes can be detected and dynamic 
microarchitecture reconfiguration can be invoked in re- 
sponse to the phase changes. 

In most proposed implementations, configurable units 
are designed to have a number of fixed configurations, 
e.g. four different cache sizes. Then* the runtime configu- 
ration algorithm selects from one of the multiple available 
configurations. Thus far, algorithms for determining the 
optimal hardware configuration have primarily been ad 
hoc, and consequently, mere are about as many algo- 
rithms as there are proposals for multi-configuration 



The research reported here is directed primarily toward 
development of configuration algorithms rather than de- 
veloping new types of multi-configuration units. The goal 
is to find fundamental techniques mat can be applied 
across a broad range of units. These algorithms will not 
only improve performance of individual multi- 
configuration units, but also permit unified control of sev- 
eral such units simultaneously. We envision these algo- 
rithms being implemented with co-designed virtual ma- 
chine software [16], but that aspect is not essential to the 
research presented here; hardware or conventional soft- 
ware implementations could also be used 

As a basis for constructing reconfiguration algorithms, 
we are studying dynamic analysis of program working 
sets. There are three aspects of working sets that are of 
interest Detection of a working set change indicates a 
program phase change, and can be used to trigger a search 
for an optimal configuration. Working set size can be 
used directly to choose the optimal configurations when 
performance is directly related to working set size (e.g. 
caches). Finally, the working set identity can be used to 
reduce re-optimization overhead: when a previously en- 
countered working set can be identified, the optimal con- 



figuration for that working set can be stored and re- 
instated 

Working sets can be quite large, and it is likely imprac- 
tical to work with foil representations of working sets. 
Consequently, we propose a small hardware table (on the 
order of 32-128 bytes) to capture a working set "signa- 
ture** that c ontains enough information to permit an esti- 
mation of the important working set characteristics. This 
working set information can be incorporated into a num- 
ber of reconfiguration algorithms, and we de mon s tr ate the 
use of working set signatures for multi-configuration in- 
struction caches. 

In the next three subsections, we summarize proposed 
methods for dynamically configuring hardware, describe 
reconfiguration algorithms, and discuss ways program 
working set behavior can be used in configuration algo- 
rithms. 

1.1 Dynamically configurable hardware 

A number of proposals have been made for adap- 
tive/configurable hardware mechanisms targeted at per- 
formance and/or power optimization A few important 
examples follow. 

• Configurable caches and TLBs - line sizes and as- 
sociativity are adjusted in response to program ref- 
erencing behavior [2, 3, 5], 

• Allocation of memory hierarchy resources - cache 
memory resources are divided among levels in the 
cache hierarchy [4] or configured for other uses, 
e.g. instruction reuse [6], 

• Allocation of memory buffer resources - the same 
buffer resources are used for stream buffers or vic- 
tim buffers, depending the current needs of the pro- 
gram [3]. 

• Configurable branch predictors - the length of the 
global history register [7] in a gshare (or related) 
predictor is varied 

• Configurable instruction windows - sections of the 
issue window are disabled when there is low in- 
struction level parallelism [8, 9]. 

• Configurable pipelines - portions of clustered mi- 
croarchitectures can be disabled [10], or a pipeline 
can vary between in order, out-of-order, and pipe- 
line gating [11]. 

Of course, these various methods are not mutually ex- 
clusive, and in practice a combination of adaptive tech- 
niques will likely be used in the same processor. This 
leads to a fairly complex optimization problem, especially 
if the methods interact with one another. Huang et at 
[12], describe a general framework and algorithms that 
are intended to deal with processors containing several 
configurable units. 



12 Dynamic reconfiguration algorithms 

Methods for controlling multi-configuration hardware 
generally involve a form of feedback where some per* 
formance characteristic (e.g. instructions per cycle (IPC) 
or miss rate) is measured and reconfiguration decisions 
are based on current and past measurements. The more 
sophisticated optimization schemes run for a fixed inter- 
val (also called a •\rindow", "step", etc) while monitor- 
ing some performance or program characteristic. This 
information is used to determine whether there has been a 
program phase change. If so, the configuration algorithm 
undertakes a tuning sequence, i.e. it systematically tries a 
number of configurations and measures the performance 
of each. It then selects the optimal one and continues, 
waiting for the next phase change. 

The algorithm shown in Fig. 1 is proposed in [4]. This 
algorithm is both one of the better documented and the 
best performing we have found; henceforth, we use this 
algorithm for comparisons and refer to it as the Rochester 
algorithm, m [4], it is used to control a multi- 
configuration data cache hierarchy. That system repeat- 
edly runs for a fixed number of instructions (100,000), 
and then makes a pass through the algorithm given in the 
figure. The system has two states: STABLE and 
UNSTABLE. As long as the configurable unit's per- 
formance, unUjmrfs does not change more than 
perfjioise level and the number of branches does not 
c han g e more than a br jioise level, the phase is STABLE 
and nothing is done. Otherwise, die phase is considered to 
be UNSTABLE, and the algorithm goes through a tuning 
sequence, looking for the best configuration. It begins 
with the smallest configuration and goes to the largest, 
unless the performance exceeds the threshold. Then, the 
algorithm selects the best performing configuration, 
makes the system state STABLE, and continues. If the 
tuning process selects the same configuration as in the 
previous phase, the noise levels are increased to prevent 
unnecessary tunings in the future. When stable, the noise 
thresholds are reduced until they reach a minimum level; 
in essence, the algorithm dynamically changes the thresh- 
old in order to detect major phase changes. 

Reconfiguration algorithms have three basic properties 
that determine their applicability and effectiveness. 

Detection efficiency - the ability of an algorithm to de- 
tect program phase changes. Low detection efficiency can 
lead to lost reconfiguration opportunities and non-optimal 
hardware configurations. 

Reconfiguration overhead - the overhead associated 
with the transition from one configuration to another. The 
reconfiguration overhead depends on the amount of state 
contained in the structure. Flushing and/or re-learning the 
state can take 10's of cycles to 1000*s of cycles (e.g. for 
reconfiguring a data cache). 

Tuning overhead - the time spent searching for an op- 
timal configuration. A high tuning overhead leads to 



higher numb er of reconfigurations and more time spent in 
the non-optimal configurations. This is a more serious 
problem in microarchitectures with several multi- 
configuration units. For example, three units with three 
configurations each, can lead to up to 27 combinations to 
explore (depending on the degree to which they interact). 
In a proposed method for resizing global branch history 
[7], up to 16 different configurations are explored. 



Figure 1. An algorithm that detects a phase 
change and then searches for the best configu- 
ration [4]. 

It is important to differentiate between number of tun- 
ings and number of reconfigurations. Each tuning can 
possibly be composed of multiple reconfigurations. 
Hence, reducing the number of tunings leads to signifi- 
cantly fewer reconfigurations, less time spent in non- 
optimal configurations, and better performance/power 
efficiency. 

1.3 Configuration algorithms using working set 
analysis 

Because phase changes are manifestations of working 
set changes [17], we consider algorithms based on analy- 
sis of explicit working set information, In Section 2, we 
define a working set signature, a lossy-compressed repre- 
sentation of the true working set By using working set 
signatures to detect phase changes, very accurate configu- 
ration algorithms can be developed In Section 3, we ap- 
ply the working set detection method to variations of the 



Rochester algorithm and show that similar average cache 
sizes and miss-rates can be achieved with fewer recon- 
figurations in some cases. 

For some multi-configuration units, the optimal con- 
figuration is directly related to working set size. In Sec- 
tion 4, we show that the working set signature can be used 
for estimating size and develop a simple algorithm for 
finding an optimal cache configuration. This algorithm 
significantly reduces reconfigurations. 

Finally, working sets can be used to identify recurring 
phases. Re-tuning is done only when a program phase 
change actually occurs. If the phase has occurred in the 
past, the optimum configuration is looked up in a table 
thereby eliminating the tuning overhead. As far as we 
know, none of the reconfiguration algorithms reported in 
literature exploit knowledge of recurring phases. In Sec- 
tion 5, we propose such an algorithm and show that reuse 
of configuration information can lead to a 95% reduction 
in number of tunings on average for integer benchmarks. 
Section 6 describes the implementation of hardware and 
software required to enable our reconfiguration scheme. 

2. Working with working sets 

For decades, operating system researchers have studied 
working set behavior to optimize memory hierarchy us- 
age, and they have shown that working sets are the cause 
of phase behavior. 

2.1 Basic definitions 

Classically, a working set W(tut) for M ,2. .., is a set of 
distinct segments {S|, sj,.., s©} touched over the i* window 
of size x [16]. The working set size is 0), the cardinality 
of the set The segments are typically memory regions of 
some fixed size, such as a page. 

Following some initial studies of working sets, re- 
searchers focused on more general models of program 
behavior and developed me phase transition model [17, 
18]. Batson and Madison defined a phase as a maximal 
interval during which a given set of segments stay on top 
of the LRU stack [18]. hi other words, a phase is defined 
as the maximum interval over which die working set re- 
mains more or less constant The phase transition model 
states that programs follow a series of steady state phases 
with rather abrupt transitions is between. Phase transition 
studies have shown that programs have a marked phase 
behavior and bigger phases are composed of several 
smaller phases. 

Most of the early working set research was directed at 
program paging behavior, but as one would expect, simi- 
lar behavior occurs with smaller, cache line-size address- 
ing units that are more in line with applications to config- 
urable hardware. Also, early work tended to lump instruc- 
tions and data together. We distinguish instruction and 
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data working sets, and in this paper we focus on tbe in- 
struction working set 

As defined, capturing a working set requires a window. 
The window size determines the finest granularity at 
which phases can be resolved In this paper, we consider 
fine grain working sets containing cache tine sized ele- 
ments (32-256 bytes) because we primarily deal with 
multi-configuration units (e.g. caches and predictors) that 
work at this granularity. Also, for design simplicity, a 
series of non-overlapping windows is used, rather than a 
sliding window as is often used in paging studies. 

The method of sampling information is another impor- 
tant parameter. In this paper, we assume that sampling 
occurs at every committed instruction. One could, how- 
ever, resort to periodic sampling or random sampling to 
reduce sampling overhead. This will be an area of future 
research. 

We are interested in identifying working sets, measur- 
ing sizes and detecting changes in working sets. In order 
to do this, we need a measure of similarity because tbe 
same phase may not always touch exactly the same seg- 
ments in each working set window. There is some level of 
noise in the measurements partially due to mismatch in 
the phase and window boundaries and partially due to 
small differences in execution. We define the relative 
working set distance 

g a |Wft> y )UW(t J ,f)|-|W(t t ,T)nW(t Jt r)| < (1) 

K^uw^T)! 

to compare two phases with working sets W(Vt) and 
W(tj/c). A large 6 value indicates a working set change 
whereas a small 8 indicates no change. At the extreme 
ends, 5 » 0 when the sets axe identical, and 5 = 1 when the 
working sets are totally different We define a threshold 
8* and say there is a working set change if h > 6*. 

22 Working set signatures 

Representing and manipulating complete working sets 
is probably unpractical for our application. Consequently, 
we propose a lossy-compressed working set representa- 
tion that we call the working set signature. 

The working set signature is an n-bit vector formed by 
mapping working set elements into n-buckets using a ran- 
domizing hash function (see Fig. 2). As mentioned before, 
the working set elements are of cache line granularity and 
hence the low-order b address bits are ignored when hash- 
ing. The size of the bit-vector is in the range of 32 - 128 
bytes. One could consider varying size dynamically to 
suit the application; this however, is a topic of future re- 
search. The bit-vector is cleared at the beginning of every 
interval (window) to remove stale working set informa- 
tion. 

Working set signatures can be used to estimate the size, 
change, and identity attributes of the full working set The 
size (number of ones) of the signature is probabilistically 
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Figure 2. Mechanism for collecting working set 
signatures, m bits selected from the program 
counter are used to address a table containing n 
bits. The table Is cleared at the beginning of each 
window, and a bit Is set If the corresponding In- 
struction block Is touched. 

related to the true working set size. When K random keys 
are hashed into n buckets, the fraction of buckets filled,/; 
is given by 

/-1-G-V. (2«) 
n 

Given the fraction of the signature rilled, the working 
set size can be estimated using the relation 



(2b) 



*-logG-/)/log(l— 
n 

Using this relation, we find that a 90% filled table cor- 
responds to a working set size about 2.5 times larger than 
the number of filled entries. In Section 3 this relationship 
will be experimentally validated. 

To detect working set changes and identities, we use a 
measure of similarity analogous to the one defined above 
for working set*. For two signatures «Sj andSi, the relative 
signature distance is defined as 

Is, es 2 | 

Le,, (ones count of exclusive OR)/(ones count of inclusive 
OR). As with full signatures, we will use a threshold 
value Aa to detect phase changes. 

3. Measuring working set changes 

In this section we use instruction working set signatures 
to detect phase changes (working set changes) and then 
incorporate this mechanism in an example configuration 
algorithm. 



3.1 Methodology 

To evaluate the properties of working set signatures, we 
used a modified version of the SimplcScalar toolset [19] 
and a subset of benchmarks from the SPEC 2000 suite. 
The benchmarks were compiled using the base level op- 
timizations. The choice of benchmarks was based on the 
presence of 1) long and short term phases with differing 
performance, 2) recurring phases, to test our working set 
identification scheme, and 3) different working sets in the 
same benchmark that led to similar behavior for certain 
cache/predictor configurations and completely different 
behavior for others - to show variable effectiveness of 



reconfiguration. 

For collecting working set signatures, a window of 
100K instructions is used (unless stated otherwise), and 
all benchmarks are run tor 20,000 such intervals or 2 bil- 
lion instructions. The signature bit vector size tor most of 
the experiments is 1024 bits (128 bytes); in Section 63, 
we show that signatures as small as 32 bytes perform 
nearly as well The hash function used during simulation 
is based on the C library functions o rand and rand. 

3 2 Signature accuracy 

In order to evaluate the accuracy of working set signa- 
ture distances (as compared with full working sets), we 
measured the relative distances between pairs of consecu- 
tive windows. Fig. 3a is a plot of the relative working set 
distance (y-axis) versus the relative distance for the cone* 
spending signatures (x-axia). This particular graph is for 
gzip, but all the benchmarks display very similar behav- 
ior. That these distances are highly correlated is evident 
There is some slight dispersion due to hash collisions 
when forming signatures. It is clear that using signatures 
for detecting phase changes will be nearly as accurate as 
using full working sets. 

For comparison, the Rochester algorithm uses the dy- 
namic count of conditional branches to measure working 
set changes. We define a relative distance metric tor 
conditional branch counts in the same way as signature 
distances i.e., 

BR„CNT l -BR^CNT^ ( 4 j 
BR_CNT„ 

where, BRjCNT t is the conditional branch count for the i* 
window. A plot of full working set distances versus the 
branch count distances shows some correlation, but with a 
high level of dispersion (Fig. 3b). More importantly, there 
are several significant working set changes that are asso- 
ciated with very small relative branch distances. 

In order to detect a phase change, we need to define the 
value of threshold -6*. The threshold is defined empiri- 
cally. Thresholds that are powers of two (0.125, 0.25, 
0.5...) are used because the implied division for forming 
the relative distance becomes a matter of shifting and 



s 



100 



80 



60 



40- 



£ 20 
I 

0 



IF 



•) 



20 40 
rehSvutpatsra 



TO 



B0 



100 



1 4o fer 




60 100 150 

rrUiive cfcUfPtn trach comti (%) 



200 



Figure 3. a) Relative working sot distance vs. 
relative signature distance for benchmark gzfp 
using a 32-byte signature, b) Relative working 
set distance vs. relative branch distance (Eq. 4). 

comparing. Experiments showed that the ability to detect 
phase changes is relatively insensitive to the threshold, 
because, as was noted in [17], a phase change tends to be 
abrupt and very pronounced. Consequently, a threshold 
of 0.5 is used, which filters out most of the noise and de- 
tects only the significant phase changes. 

33 Evaluation: managing configurable hardware 

hi this subsection, we use working set signatures for de- 
tecting phase changes, and incorporate phase change de- 
tection into a reconfiguration algorithm. To illustrate its 
performance, it is applied to a multi-configuration instruc- 
tion cache. 

The algorithm we propose is given in Fig. 4 and will be 
referred to as the signature based algorithm. The signa- 
ture size is 128 bytes. This algorithm has three states: 
STABLE - when the program working set is stable and 
the configuration is optimal, UNSTABLE - when the 
working set is in transition and TUNING - when the 
working set is stable and different configurations are be- 
ing explored. 

At the end of each window (100K instructions), the 
relative signature distance with respect to the previous 
signature is computed. Assuming the system is initially 
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Figure 4. Basic algorithm based on working set 
signatures. The algorithm uses relative signature 
distances (represented with # operator) to detect 
phase changes and then performs tuning when 
the phase transition completes. 

STABLE, if the distance is greater than the threshold 
(0.5), the state becomes UNSTABLE and subsequent in- 
tervals wait for the distance to go below the threshold, 
indicating stability has been restored. When this happens, 



the state transitions to TUNING, and the algorithm begins 
searching for the optimal configuration. Once the optimal 
configuration is found, the state transitions to STABLE. 
On the other hand, the state transitions back to 
UNSTABLE if the signature distance exceeds the thresh- 
old while TUNING is in progress. 

The Rochester algorithm and the signature-based algo- 
rithm are similar in overall structure, but one difference is 
that the signature-based algorithm does not tune while the 
working set is in transition; it waits for the phase to stabi- 
lize. 

To illustrate the algorithm's performance, we consider 
an instruction cache that can be reconfigured to 2KB, 
8KB, 32KB or 128KB, depending on the requirements of 
the program. The goal is to save power by using the 
smallest cache that gives good performance. We use the 
cache miss rate as a measure of performance, the number 
of ieconflgurations/tunings as a measure of overhead, and 
the average cache size as a measure of power consump- 
tion. 

For comparison we use the Rochester algorithm given 
in Fig. 1, adapted to instruction cache configuration. As 
noted earlier, this algorithm detects phase changes using 
dynamic branch counts. The parameters used for the algo- 
rithm [4, 23] are baseJsr_noise - 4500, brjtec - 50, 
brjnc - 1000, basejpetfjwts* m 450, perfjiec « 5, 
perfjnc - 100 and threshold** 2%. 
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Figure 5. Average miss rates and cache ekes for SPEC2K floating-point (left) and Integer (right) bench- 
marks. The last column in each graph shows the average over all the benchmarks In that graph. Results 
are shown for the Rochester algorithm, basic signature based and extended signature based algorithms 
(Sec. 3.3); signature size based algorithm (Sec 4JZ); phase table based algorithm (Sec. 5.1). 
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Rgure 6. Number of tunings and reconfigurations for 8PEC2K floating-point (left) and Integer (right) 
benchmarks. The last column In each graph shows the average over all the benchmarks In that graph- 
Results are shown for the Rochester algorithm, basic signature based and extended signature based 
algorithms (Sec. 3.3); signature size based algorithm (Sec 4*2); phase table based algorithm (Sac- 5.1). 

Fig. 5 shows the average cache miss rate and average 
cache size for the Rochester and basic signature-based 
algorithms (first two bars; other bars will be described 
later). On average, all the algorithms perform very simi- 
larly in terms of miss rates and average cache sizes. A 
point to emphasize here is that any algorithm with a suffi- 
cient number of tunings will achieve near-optimal instruc- 
tion cache sizes and miss rates, and in the remainder of 
the paper, we do not draw any real distinctions among 
algorithms on that basis. These results do show the advan- 
tage of (he dynamic configuration approach, however. For 
example, compared to a configuration having 128KB in- 
struction cache (0% miss rate on average, not shown in 
the figure), the signature-based algorithm reduces average 
cache size by 82% for an increased miss-rate of just 0.4%. 

The number of tunings and reconfigurations (Fig. 6) are 
the key extinguishing features directly related to the algo- 
rithm's performance overhead, and we focus on these 
measures in comparing algorithms. Recall that a tuning 
occurs when the algorithm initiates a search for the opti- 
mal configuration; a reconfiguration occurs whenever the 
configuration changes. 

The signature-based algorithm is comparable to the 
Rochester algorithm in number of reconfigurations; how* 
ever, the Rochester algorithm has the advantage of per- 
forming far fewer tunings. This is mainly because the 
Rochester algorithm detects when unnecessary tunings 
occur and "backs off* by increasing the noise levels. This 



feature is especially useful when there are frequent phase 
chang es that do not require reconfiguration. On the other 
hand, the basic signature-based algorithm performs tun- 
ings every time a phase change is detected; there is no 
"backoff*. 

To reduce unnecessary tunings, we extend the signa- 
ture-based algorithm to wait for 4 stable intervals before 
tuning. Also, if the state is UNSTABLE for more than 10 
intervals and performance is below threshold, the cache 
size is increased to the maximum. This acts as a backup 
strategy in cases where the working set does not stabilize, 
so tuning is never performed With the extended algo- 
rithm, the number of timings is reduced by 74% on aver- 
age, compared to me basic algorithm (Fig. 6). 

4. Measuring working set sizes 

As mentioned earlier, the signature size (one*s count of 
the signature) is closely related to the actual working set 
size. Thus, in those cases where performance is directly 
related to the working set size, for example instruction 
and data caches, the signature size can be used to deter- 
mine the optimal configuration; there is no need for tun- 
ing. 
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Figure 7. Working sat size vs. normalized one's 
count of the signature for Instruction working 
set of SPEC2K benchmark gcc. Signature size 
used Is 128 bytes. 

4.1 Working set size experiments 

We collected the true working set and the working get 
signature for each window of 100K instructions. Then, 
the true size of . the working set versus the signature size 
was plotted Since a randomizing hash is used, the graphs 
for all die benchmarks are essentially identical (and fit Eq. 
2b). A representative plot for the instruction working set 
is shown in Fig. 7. 

As expected, for small working sets, the graph is close 
to linear with a slope of 1 and as the working set gets big- 
ger* the graph becomes non-linear. Even In the non-linear 
section, the signature can give reasonably accurate work- 
ing set size estimates 3-4x the maximum signature size. 
This means that a typical signature size we have been 
considering (32-128 bytes) with line-size granularities 
(32-128 bytes) can be used to estimate working set sizes 
of many tens to hundreds of Kbytes - adequate for recon- 
figuring LI caches. By increasing the granularity (future 
research), we expect the reach to be extended to 12 cache 
sizes. 

42 Evaluation: reconfiguration using signature 
size 

The extended signature-based algorithm can be modi- 
fied to use the signature size tor selecting an optimal 
cache configuration - the smallest that holds the current 
working set (plus 10% to allow for some noise). To de- 
termine the appropriate size, equation 2 (Section 2) is 
used. This eliminates the need to tune, and it typically 
reduces the number of reconfigurations as welL The main 
advantage lies in the significantly smaller number of re- 



configurations (Fig. 6: signature size) - on average, 75- 
80% fewer than the Rochester and extended signature 
algorithms. The effect is much more prominent in a 
benchmark like gzip, Gzip has lots of dynamic phases 
with a cache requirement of 8KB, separated by phases 
with a requirement of 2KB. When tuning, the Rochester 
and signature-based algorithms try me 2KB configuration 
before trying the 8KB. On the other hand; the signature- 
size based algorithm sets the size to 8KB directly, avoid- 
ing half of the reconfigurations. 

5. Identifying recurring phases 

As a program executes, it goes through many phase 
changes. However, the same phases often recur multiple 
times during program execution. As tar as we know, no 
previous work has proposed the saving of recurring phase 
Information to avoid re-tuning. Do mis section, we study 
such an algorithm. Briefly, this will be done by maintain- 
ing a phase table in memory. After tuning has determined 
the optimal configuration for a particular phase, it will be 
stored in the table. Later, if the phase recurs, the optimal 
configuration can be reinstated without going through the 
tuning process. 

5.1 Phase statistics 

Table 1 shows some general characteristics of phases as 
identified in the simulations. The program execution con- 
sists of a sequence of stable 100K instruction intervals 
separated by unstable intervals. Each "run* of stable in- 
tervals is defined as one dynamic phase. If the relative 
signature distance between two different dynamic phases 
is within the 0.5 threshold, we say that they are the same 
static phase. The average phase lengths are computed by 
averaging the lengths of all the dynamic phases. 

In general, the floating-point benchmarks have longer 
phases, typically 10's of millions of instructions — primar- 
ily due to the long loops of numerical code. The integer 
benchmarks on the other hand have much shorter phases; 
less than one million instructions for gzip and gcc. For 
many of the floating point benchmarks 99+% of the time 
is spent in stable phases. As the average phase length de- 
creases there are more transitions, and hence the traction 
of time in a stable region decreases (60-80% for gzip and 
gcc). 

The presence of recurring phases is evident in all the 
benchmarks by comparing the number of dynamic phases 
with the number of static phases. However, the degree to 
which phases recur is reduced by the relatively short 
benchmark runtimes (2 billion instructions). Several ben- 
chmarks were run for 10 billion instructions and the 
number of dynamic phases was almost three orders of 
magnitude greater than the number of static phases. This 



Table 1, Benchmark characteristics. Columns 
are benchmark name, number of dynamic and 
static phases, number of static phases that toad 
to 95% of stable time, average phase length In 
units of 100K instructions and the percentage of 
time spent In stable phases. 
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indicates that the gains of rousing configuration informa- 
tion for recurring phases increase with time. 

The "95% stable time" column of the table is the num- 
ber of static phases that account for 95% of the dynamic 
phases. These numbers are quite low, fewer than 20 in 
every case. This indicates that a relatively small signature 
table will be sufficient for covering most recurring 
phases. 

5.2 Evaluation: recurring working sets 

The algorithm for exploiting recurring working sets is 
similar to the one given in Fig. 4. However, on detecting a 
phase change, the algorithm first performs a table lookup 
to sec if configuration information for the phase exists in 
the table. If so, the optimal configuration is reinstated. If 
not, the algorithm goes into the TUNING state. At the end 
of tuning, the op tima l configuration is committed to the 
signature table. 

In addition to the configuration information, the table 
also keeps track of phase lengths. If, during its last execu- 
tion, the length was fewer than four intervals (400,000 
instructions), then tuning is not performed. This avoids 
tuning for insignificant phases. Four intervals are chosen 
because the tuning process takes a maximum of tour in- 
tervals. 

The results for the algorithm are shown in Figs. 5 and 
6, labeled phase table. The important difference lies in the 
number of tunings performed by the phase table algo- 
-. rithrn. The algorithm performs 67% fewer tunings for 
floating point benchmarks and 92% fewer tunings for the 
integer benchmarks compared with the extended signature 



based algorithm. In situations where the tuning process is 
complex, this can lead to significant improvements in 
performance. 

6. Implementation 

To implement configuration algorithms, we propose a 
combination of hardware and software. Software per- 
forms higher-level configuration decisions, and hardware 
collects working set signatures, and, possibly, performs 
some of the lower level analysis. 

6.1 VMM based configuration management 

To perform working set analysis and manage configur- 
able hardware of wide variety and complexity, we are 
developing a co-designed virtual machine monitor 
(VMM) [16] - » lay** of software designed concurrently 
with the hardware implementation. This software is hid- 
den from all conventional software and would typically 
be developed as part of the hardware design effort The 
base technology is used in the Transmeta Crusoe [20] and 
the IBM Daisy/BOA projects [21] primarily to support 
whole-system binary translation. In mis work, we are not 
interested in the binary translation aspect In fact, for 
manag ing configurable hardware, there needs to be no 
changes made to existing binaries. 

Of course, VMM software is not the only option for 
managing the optimization process. Low-level operating 
system software could also be used. This, however, re- 
quires the addition of implementation dependent code to 
the OS. One could also consider microcode in place of 
VMM software. The microcode can reside in ROM, but 
there must still be some hidden memory for maintaining 
data structures such as the phase table. A special purpose 
co-processor [22] is another good candi d ate tor managing 
the hardVare configuration. It has the advantage of saving 
optimization time overhead at the expense of additional 
hardware. 

In the most straightforward implementation, working 
set signatures are collected by hardware, and men the raw 
signature data is read and analyzed by VMM software. 
The working set size/difference algorithms we propose 
can easily be performed in software. With the assumed 
window size, VMM software is invoked very 100K in- 
structions. Because in most cases the relative signature 
distance will be very small, the VMM overhead will also 
be small - probably a few tens of instructions. If this 
overhead is still too high, a longer sampling interval can 
be used, or hardware can be used to perform some of the 
low level analysis. This is described in the next subsec- 
tion. 

A phase table lookup ostensibly requires a linear search 
of signatures, but it can be made more efficient by using 
techniques such as hashing based on the signature size, 



early exits when the phase is same as the previous one, 
etc. This win be a topic fox future study as the VMM b 
implemented. 

62 Hardware working set analysis 

Besides collecting working set signatures, hardware can 
also be used for estimating working set size and/or to de- 
tect working set changes, thereby reducing software over- 
head In particular, detecting working set changes in 
hardware avoids invoking the VMM between each inter- 
val; the VMM has to be invoked only when the working 
set actually changes. Furthermore, for very simple recon- 
figurations that are directly related to working set size 
(e.g. cache configurations), it may not be necessary to 
enter the VMM at all; hardware can determine the proper 
configuration based only on the size of the working set 
signature. It is important to emphasize that mis hardware 
is not on the critical path and hence can be implemented 
with slow, low power transistors. 

To measure size, there must be a hardware counter 
which increments whenever a bit in the signature changes 
from 0 to 1. This requires reading the signature entry be- 
fore writing to it 

To measure the relative signature distance, a second 
signature register is required to hold the signature for the 
previous window. As defined in section 2.2, me relative 
signature distance is the ratio of the exclusive-OR to the 
inclusi ve-OR of the signatures - say X/N. X and N can be 
evaluated dynamically as follows. 

Initially, X-N«count of ones in the previous signature. 
For each signature access, both the previous and current 
signature values are read. If previous=0 and currenM), 
both X and N are incremented If previous^) and cur- 
rental, nothing is done; if previous- 1 and currenH), men 
die bit in the previous signature is cleared and X is dec- 
remented; the case previous 33 ! and current"! should 
never happen. Then at the end of the interval, hardware 
can rind the relative signature distance X/N (or approxi- 
mate it by shifting and comparing, when the threshold is a 
power of two). The VMM can set up the hardware to trap 
to VMM software on values above the threshold 

63 Implementation cost 

The primary cost is the working set signature. This 
consists of 128 bytes and can be placed off the critical 
path. Using smaller signatures can further reduce the 
hardware cost Fig. 8 shows that a signature as small as 32 
bytes can resolve most of die dynamic phases resolved by 
a 512-byte signature. Small signatures are unable to re- 
solve certain phase changes for benchmarks with large 
working sets (peri and gec) due to collisions in the signa- 
ture table, which lead to smaller relative distances. Pre- 
liminary experiments have shown that using smaller 



thresholds is a solution. Dynamically varying thresholds 
and/or signature sizes, to accommodate larger working 
sets is a topic of future research. 

In the simple implementation (where the VMM per- 
forms die relative distance computation) the memory only 
has to be written in normal operation. Furthermore, it is 
not critical mat every instance of an element of the work- 
ing set be recorded Only one occurrence of the clement 
has to be recorded and most elements appear multiple 
times. Thus, if occasionally dropping an element simpli- 
fies hardware (for example, retiring instructions from 
different cache blocks in one cycle) little accuracy is lost 
For the determining the relative signature distance in 
hardware, two copies of the signature memory are needed, 
and they are both read and written during the collection 
phase. 
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signatures of sizes 25&-4096 bits (32-512 bytes). 
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signature. 

7. Related work 

Previous work related to hardware reconfiguration was 
discussed in Sec. LI. In this section, we briefly discuss 
work related to working set analysis. 

Sherwood et aL [24] proposed the use of program phase 
information to speed up simulation. They use basic block 
execution frequency information as a fingerprint for an 
interval of execution. The goal then, is to find a small set 
of intervals whose fingerprint matches that of die whole 
program. Detailed simulation over these intervals can 
give a fairly accurate estimate of the performance of the 
whole program. 

Adaptive mode control (AMQ caches, proposed by 
Zhou et al [25], keep track of the working set in order to 
enable/disable cache lines. The AMC cache keeps a 
counter for each of the tags to measure activity. If the 
cache line is not accessed for a particular interval, then it 



is put to sleep. However, the corresponding tag entry is 
not put to sleep, thereby allowing continuous monitoring 
of the working set and avoiding tt just-in-casc w periodic 
up sizing. 

HP Dynamo [25], a run time dynamic optimization sys- 
tem, uses a measure of working set change to flush stale 
data translations from a cache. Dynamo optimizes traces 
of the program to generate fragments, which aire stored in 
a fragment cache. At steady state, most of the mstrucaons 
are fetched from the fragment cache. When the working 
set changes, the rate of fragment formation increases. This 
is used as a trigger to flush stale fragments from the 
cache, making room for the new ones. 

Merten et aL [27] describe a framework for dynamic 
optimization, which profiles branches to detect working 
set hot-spots. This is mainly done using a branch behavior 
buffer, which collects frequently executed branches. The 
hot-spot information can be fed into a run-time optimizer 
such as Dynamo to achieve perfo rm ance improvements. 

8. Conclusions and future research 

We iritroduced the concept of a working set signature, a 
lossy-compressed representation of the program working 
set. The signatures provide a robust mechanism for de- 
tecting working set changes. Also, unlike previously re- 
ported methods, the signatures can be used to identify 
specific working sets. This provides an opportunity for 
storing configuration information associated with recur- 
ring working sets. Algorithms using complex tuning 
mechanisms can benefit significantly from reuse of con- 
figuration information. 

When applied to an instruction cache reconfiguration 
algorithm, the signatures detect most of the major work- 
ing set changes. This algorithm achieves 27% fewer tun- 
ings and 18% fewer reconfigurations than the Rochester 
algorithm - probably the best published to date. 

Working set size information can be derived from the 
signature and can be used to configure the instruction 
caches directly. An algorithm based on this achieves per- 
formance sunilar to the signature-based algorithm using 
74% fewer reconfigurations. 

Finally, an algorithm based on reuse of configuration 
information leads to 80% fewer reconfigurations com- 
pared to the Rochester algorithm. These results suggest 
mat an algorithm based on reuse of configuration infor- 
mation can potentially perform much better than other 
algorithms when the tuning overhead is high. 

We plan to continue the development of a VMM that 
implements these algorithms. This development will in- 
clude 

• Algorithms for tuning multiple interacting units in 
a way that optimizes performance and/or power 
efficiency. The work in [12] is an important first 
step in this direction. 



• Study of the relationship between the signature 
size, the PC bits, sample interval and thresholds. It 
is likely that the VMM can adjust the PC bits and 
sample interval dynamically to adapt to working 
set size. 

• Study of sampling schemes such as periodic sam- 
pling, to reduce sampling overhead 

• Study of algorithms for building and managing the 
signature table. In particular, it will be necessary to 
develop fast algorithms tor searching the table to 
find recurring phases. 

The ultimate goal is to define the overall VMM struc- 
ture and to apply it to a highly configurable rnicroarchi- 
tecture. 
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Seongmoo Heo, Ken Barr, Mark Hampton, and Krste Asanovic 

e Drowsv Caches: Techniques for Redudne Leakage PowiiL 
1 KrisztiO n Flautner, Nam Sung Kim, Steve Martin, David Blaauw, Trevor Mudge 

• Sim-GALS: A Globattv Asvnchronous-Locallv Synchronous Processor Simulation 
Environment 

i Diana Marculescu 


3:30 


4:00 


Break 






Memory Systems 

Session Chair: Margaret Martonosi 

• Using a User-Level Memorv Thread for Correlation Prefetching 
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Program An< 
4:00 


j Schedule 
5:30 


Yan Solihin , Jaejin Lee, Josep Torrellas 
- «iH Initialization Misses to the Heap 

Jarrod A. Lewis, Bryan Black, Mikko H. upasti 
. rH „ f tu, Distance for TLB Prefetching An AopUcation-dmen Study 

Gokul B. Kandiraju, Anand Sivasubramaniam . 


5:30 
7:00 


7:00 
8:30 


Dynamic Optimization 
Session Chair: Dirk Grunwald 

. Timekeeping, in the Memm system da WstesLAweasb to Prating and 

(ipffnffzing Memory Behavior 

Zhigang Hu Stefanos Kaxiras, Margaret Martonosi 
m Implementing Optimizations aiuecoue tune 

Ilhyun Kim, Mikko Lipasti 
. ^noina Cnnfieurable Hardware via Dynamic Working Set Analysis 

Ashutosh S. Dhodapkar, James E. Smith m 

Reception J 


8:30 


9:30 


Business meeting ■ — 1 

Wednesday, May 29tn 


9:00 


10:00 


Data and Storage Networks 
Session Chair: Seth Goldstein 

• Queue-Pair IP: A Hvbrid Architecture for Svstem Area Networks 

Philip Buonadonna , David Culler 
. yj-Atiarhed Database Storaee 

Yuanyuan Zhou, Suresh Jagannathan, Angelos Bilas , 

Cezary Dubnicki, James F Philbin, Kai Li, 


10:00 


11:00 


Vector Architectures 
Session Chair: Mateo Valero 

. Speculative DvnamicVectorizatwn 

au„ n.i.tain a M+Anin (~wf\r%Ta 1 X^atpA Valero 

Alex "ajueio, Antonio vjonzaicz, ivim-cv v aiciu 
- Tarantula: A Vector Extension to the Alpha Architecture 
Roger Espasa, Federico Ardanaz, Joel Emer, Stephen Felix, Julio Gago, Roger Gramunt, 
Isaac Hernandez, Toni Juan, Geoff Lowney, Matt Mattina, Andre Seznec 


11:00 


11:30 


Break , . 


1 1 

1 1 JU 


1-00 

1 »v\/ 


Supporting Deep Speculation 
Session Chair: Artur Mauser 

. Tiesiqn Tradeoffs for the Alpha EV8 Condithnal Branch JBrediefsr 
AnHrn ^i>7n(>o Stenhen Felix. Venkata Krishnan.Yanakis Sazeides, 

. nifficuit-Path Branch Prediction Usine Subordinate Microthreads 
Robert S. Chappell, Francis Tseng, Adi Yoaz, Yale N. Patt, 

• A Scalable Instruction Queue UsineVependence Chains 
Steven E. Raasch , Nathan L. Binkert , Steven K. Reinhardt 
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